# Full text of "DTIC ADA218092: Evaluation of Optimal Appropriateness Measurement for Use in Practical Settings"

## See other formats

AD-A218 092 m file rc ■ • AFHRL-TP-89-41 AIR FORCE 8 n H U s N EVALUATION OF OPTIMAL APPROPRIATENESS MEASUREMENT FOR USE IN PRACTICAL SETTINGS S F. Drasgow M. V. Levine B. Williams C. McCusker A G. L. Thomasson ^ R. G. Lim Model Based Measurement Laboratory University of Illinois 210 Education Building 1310 South Sixth Street Champaign, Illinois Gi820 DTIC ELECTEi FEB 16 1990] a, B > MANPOWER AND PERSONNEL DIVISION Brooks Air Force Base, Texas 78235-5601 January 1990 Final Technical Paper for Period October 1987- June 1989 Approved for public release; distribution is unlimited. LABORATORY AIR FORCE SYSTEMS COMMAND BROOKS AIR FORCE BASE, TEXAS 78235-5601 90 02 14 008 NOTICE When Government drawings, specifications, or other data are used for any purpose other than in connection with a definitely Government-related procurement, the United States Government incurs no responsibility or any obligation whatsoever. The fact that the Government may have formulated or in any way supplied the said drawings, specifications, or other data, is not to be regarded by implication, or otherwise in any manner construed, as licensing the holder, or any other person or corporation; or as conveying any rights or permission to manufacture, use, or sell any patented invention that may in any way be related thereto. The Public Affairs Office has reviewed this paper, and it is releasable to the National Technical Information Service, where it will be available to the general public, including foreign nationals. This paper has been reviewed and is approved for publication. JAMES A. EARLES Contract Monitor WILLIAM E. ALLEY, Technical Director Manpower and Personnel Division DANIEL L. LEIGHTON. Colonel, USAF Chief, Manpower and Personnel Division REPORT DOCUMENTATION PAGE form Approved OMB No. 070441BB Public repotting Durden for this codecnon of information >t estimated to average i Sour oer rnpom*. including the time foe reviewing instruction*. searching existing date totre**. gathering and maintaining the data needed, and completing and reviewing the collection of information Send comments reoarding this burden estimate or any other aspect of thH collection of information, metudmg suggestions for reducing this Durden, to Washington Headquarters Services. Directorate for information Operations and Reports. 1214 Jefferson Oavts Highway. Suite '204. Arlington. VA 22202-4302. and to the Office of Management and Budget. Paoerwor* Reduction Project (0704-01M). Washington. OC 20901. 1. AGENCY USE ONLY (Le*v* bUnk) 2. REPORT DATE January 1990 4. TITLE AND SUBTITLE Evaluation of Optimal Appropriateness Measurement for Use in Practical Settings 6. AUTHOR(S) F. Drasgow C McCusker M. V. Levine G. L. Thomasson B. Williams ' R. G. Urn_ _ 7. PERFORMING ORGANIZATION NAME(S) AND AOORESS(ES) Model Based Measurement Laboratory University of Illinois 210 Education Building 1310 South Sixth Street Champaign, Illinois 61820 9. SPONSORING MONITORING AGENCY NAME(S) AND ADORESS(ES) Manpower and Personnel Division Air Force Human Resources Laboratory Brooks Air Force Base, Texas 78235-5601 3. REPORT TYPE AND OATES COVERED Final - October 1987 to June 1989 S. FUNDING NUMBERS C - F41689-87-D-0012 PE - 62205F PR - 2922 TA - 02 WU - 02 8. PERFORMING ORGANIZATION REPORT NUMBER 10. SPONSORING/MONITORING AGENCY REPORT NUMBER AFHRL-TP-89-41 12«. DISTRIBUTION/ AVAILABILITY STATEMENT Approved for public release; distribution is unlimited. 12b. DISTRIBUTION CODE ABSTRACT (Minimum 200 words) ( ••Optimal appropriateness indices were evaluated for use in applied settings. The first study demonstrated the adaptation of existing computer software to the problem of identifying cheaters with scores in specific test score ranges. It is shown that the Levine and Drasgow (1988) algorithm can be applied directly to this problem. Then simulated and real data were used to determine the rates of detection of cheating on 5, 10, or 15 items on each of two tests and obtaining a total test score in one of two total test score ranges. As expected, low rates of identification were obtained for cheating on only 5 items per test and reasonably high rates were obtained for cheating on 15 items. Moderately lower Identification rates were obtained with real data than with simulation data. The extent of difficulties that are likely to occur when the assumptions of optimal indices are violated were evaluated in the four robustness experiments that constituted Study Two.\ Optimal indices were found to be quite robust to: misspecification of the ability density, the use :of estimated item characteristic curves and option characteristic curves in place of the true (simulation) curves, and misspecification of the number of aberrant responses. On the other hand, substantially lower detection rates were obtained when local Independence was violated by generating item rescronses with two correlated (.70) traits but computing optimal indices with the incorrect assumption ofya unidimensional latent trait space* Some implications of these results are presented in the final sepCion of this paper. SUBJECT TERMS rmed Services Vocational Aptitude 8attery appropriateness measurement j aptitude test t _ 17. SECURITY CLASSIFICATIO Of REPORT Unclassified NSN 7540-01-280 5500 \ Standard Form 298 (Rev 2-89) SECURITY CLASSIFICATION OF THIS PAGE Unclassified 16. PRICE CODE 19. SECURITY CLASSIFICATION 20. LIMITATION OF ‘BSTRACT OF ABSTRACT Unclassified UL Standard Form 298 (Rev 2-89) P'^tf'DDO bv 4NSl tld ZJ9-'I 294 102 AFHRL-TP-89-41 January 1990 EVALUATION OF OPTIMAL APPROPRIATENESS MEASUREMENT FOR USE IN PRACTICAL SETTINGS F. Drasgow M. V. Levine B. Williams C. McCusker G. L. Thomasson R. G. Lim Model Based Measurement Laboratory University of Illinois 210 Education Building 1310 South Sixth Street Champaign, Illinois 61820 MANPOWER AND PERSONNEL DIVISION Brooks Air Force Base, Texas 78235-5601 Reviewed by Linda T. Curran, Acting Chief Enlisted Selection and Classification Function Submitted for publication by Lonnie D. Valentine, Jr., Chief Force Acquisition Branch This publication is primarily a working paper. It is published solely to document work performed. SUMARY The military services have a vital concern in assuring that aptitude test scores used for enlistment selection and classification are appropriate measures of applicants' true abilities. Substantial bonuses have been paid to examinees with sufficiently high scores as enticements to enlist into selected occupations. Also, failures in the services' training schools due to a lower aptitude than that necessary for successful completion cost thousands of dollars per individual. Therefore, cheating to improve scores on an enlistment test is a threat to the integrity of the services' selection and training systems. The goal of appropriateness measurement is to identify individuals who have not been accurately assessed by a multiple- choice test and, therefore, preserve the integrity of the test. This effort investigated the utility of several appropriateness indices in identifying cheaters who were very low or who were Just below average in verbal and quantitative aptitudes. The amount of cheating was 5, 10, or 15 items on tests of approximately 50 items in length. Real data as well as data simulated to maximize realism were used in the investigation. Low rates of identification were obtained for cheating on 5 items. This was expected because on an item for which an examinee does not know the right answer, it is very difficult to distinguish a correct response due to cheating from a correct response due to a lucky guess. A small number of lucky guesses is not unusual. Reasonably high rates of identification were obtained when cheating occurred on 15 items. The above findings were based on (a) the sample having a normal ability distribution, (b) known probabilities of correct responses, (c) cheaters having a fixed and known number of compromised items, and (d) a complete knowledge of which test items were verbal and which were quantitative. Some appropriateness indices worked reasonably well when actual examinee responding deviated from the first three conditions. Condition d cannot be violated; however, it is not necessary to develop a separate appropriateness measure for verbal and for quantitative aptitudes. A method for extending appropriateness measurement to two aptitude areas has already been developed and can be used when the items belonging to each aptitude area are designated. It is concluded that the utilization of appropriateness indicfj for identification of examinees for retesting would be expected to improve the quality of a large testing program. Accession for BTIS 0RAAI ar OTIC TAB □ Unannounced Justlf lest Ion_ □ i By_ i Dist ribution/_ Availability Codes Diet Avail and/or Speciel PREFACE This effort was accomplished under Project 2922, Prototype Development and Validation of Selection and Classification Instruments. It represents the continuing effort of the Air Force Human Resources Laboratory to fulfill its research and development responsibilities through development and application of state-of-the-art methodologies in the area of enlisted selection and classification. l'i TABLE OF CONTENTS Page I. INTRODUCTION . 1 Appropriateness Indices . 2 II. STUDY ONE: TESTING SPECIFIC HYPOTHESES . 3 Purpose . 3 Likelihood ratio . 3 Method . 5 Results . 8 III. STUDY TWO: ROBUSTNESS OF OPTIMAL INDICES TO VIOLATIONS OF ASSUMPTIONS . 16 Purpose. 16 Method. 17 Results.20 IV. CONCLUSIONS AND DISCUSSION . 28 REFERENCES.31 LIST OF TABLES Table Page 1 Selected Rates of Detection of Spuriously High Response Patterns with Total Test Scores in the 20th Through 24th Percentile, Simulation Data . 9 2 Selected Rates of Detection of Spuriously High Response Patterns with Total Test Scores in the 20th Through 24th Percentile, Real Data. 10 3 Selected Rates of Detection of Spuriously High Response Patterns with Total Test Scores in the 20th Through 24th Percentile, Simulation Data . 12 4 Selected Rates of Detection of Spuriously High Response Patterns with Total Test Scores in the 50th Through 54th Percentile, Real Data. 14 5 Selected Rates of Detection of Aberrant Response Patterns by the Likelihood Ratio Evaluated with True and Estimated Item Parameters.22 6 Selected Rates of Detection of Aberrant Response Patterns by the Likelihood Ratio Evaluated with Correct and Incorrect Assumptions about Dimensionality . 24 Hi 7 Selected Rates of Detection of Aberrant Response Patterns by the Likelihood Ratio Evaluated with Correct and Misspecified Ability Densities . 25 8 Selected Rates of Detection of Aberrant Response Patterns by the Likelihood Ratio Evaluated with Correct and Incorrect Specifications of the Number of Aberrant Responses . 26 LIST OF FIGURES Figure Page 1 Fit Plots for an Item Characteristic Curve and Three Conditional Option Characteristic Curves Obtained with the ForScore Computer Program . 18 2 Density Functions of the Rescaled Chi-Square Distribution with Ten Degrees of Freedom and the Standard Normal Distribution . 21 1 v INTRODUCTION 1. Standardized psychological tests are administered to tens of millions of examinees per year. One test, the Armed Services Vocational Aptitude Battery (ASVAB), is administered to approximately 2.5 million examinees annually. The scores that result from standardized tests affect the lives of examinees by opening and closing doors to training programs, employment, and education. Appropriateness measurement was proposed by Levine and Rubin (1979) as a means for identifying individuals who have been mismeasured by a standardized test. A general approach to specifying statistically optimal methods for this task was recently presented by Levine and Drasgow (1988). Their approach can be used to determine appropriateness indices that are optimal in the sense that no other statistic computed from the same data can provide higher rates of detection of the specified testing anomaly at the same false positive rate. Drasgow, Levine, and McLaughlin (1987, in press) and Drasgow, Levine, McLaughlin, and Earles (1987) compared optimal appropriateness indices to earlier, nonoptimai indices described by Drasgow, Levine, and Williams (1985), Hudner (1983), Sato (1975), Tatsuoka (1984), and Wright (1977). For unidimensional tests, they found that the best nonoptimai indices sometimes provided rates of detection of aberrant response patterns that were almost as high as the rates of optimal indices. In other cases, the best nonoptimai indices were far less powerful than optimal indices. Multi-test extensions of the nonoptimai indices were found to be less effective relative to multi-test optimal indices for a test battery consisting of two unidimensional tests. In this case, nonoptimai indices rarely provided rates of detection that were close to the detection rates of optimal indices. A number of difficulties and uncertainties have limited applications of optimal appropriateness indices. To date, formulas for optimal indices have been derived for only a few types of mismeasurement. Some of the formulas that have been derived are quite complex. A considerable investment of time and effort has been necessary to develop and program algorithms for evaluating the complex formulas. Very little is known about the robustness of optimal indices to violations of their underlying assumptions. The research reported here was conducted in response to these problems. In Study One, existing software was used to test specific hypotheses with optimal indices. The performance of optimal indices was evaluated and compared to nonoptimai indices. Study Two examined the robustness of optimal indices to four different violations of assumptions. Specifically, multidimensional item responses were analyzed with a unidimensional model, estimated item characteristic curves (ICCs) and option characteristic curves (OCCs) were used rather than the true ICCs and OCCs, ability parameters were sampled from a distribution related to the chi-square distribution with 10 degrees of freedom but optimal indices were computed assuming that ability 1 was normally distributed, and optimal indices were computed for forms of aberrance (e.g., cheating on 20 % of the test) that did not match the way aberrance was simulated (e.g., cheating on 30£ of the test). App r opriateness Indices The primary focus of the research described in this paper is the evaluation of optimal appropriateness measurement. In the next subsection, a brief summary of optimal indices is provided; references to articles containing technical details are also given. Results for two non-optima 1 appropriateness indices were also obtained in Study One. The first of these two indices is the standardized X„ index, which was described by Drasguw et al. (198b). The second nun-optimal index, F2, is a standardized fit statistic given by Hud tier ( 19 H(). Optimal appropriateness indices . Levine and Drasgow (198b) showed that a most powerful appropriateness index for a given form of aberrance on a unidimensional test is the likelihood ratio (Lk) statistic -Aberrant^ -Normal^ ( 1 ) Here !■' . (u) denotes the likelihood of a vector of n item responses u = -Aberrant (u,, u ..., u 1 given a specified form of aberrance and If. ,(u) denotes -1’ -2 -ii ° -Norma I the likelihood of u given the model of normal responding. To illustrate 1 J „ ,(u) and P„. . (u), assume that the item -Normal -Aberrant ’ responses are scored dichotomously, the test is unidimensional, P.(0) is the probability of a correct response to item _i_ by normal examinees with ability 0, and the ability density is f(0). Then the conditional likelihood of u i s iWma. (u|U) a \ " i - 1 ( 2 ) arid the marginal likelihood is IT, . <u) = 1 IT. (ul0)f(0)dO -Normal -Normal - (3) Levine and Drasgow (1988) showed that P., ,, (u) can also be computed as n DG r’T'cif 11 2 —Aberrant^ ^ 4berrant (u|0) i: (e)d0 (4) and presented methods that allow £ftberrant^ u ^ 0 ^ to be com P utec * fairly easily. A very efficient method for approximating the quantity in Equation 4 was devised by Levine (in preparation; see Drasgow, Levine, 4 McLaughlin, in press, for an application). Although Levine's approximation was developed in the context of a multidimensional test battery, it can also be used for unidimensional tests. For a composite of two unidimensional tests, the likelihood is // P(0 1 = u^G,) P(U 2 = u 2 l0 2 ) f(0)d0, (5) where P(U = u 10 ) is the likelihood of the n item responses u on test J^, J J J J J J. = 1,2, under either the normal or aberrant model. An interesting feature of Levine's approximation for either the unidimensional (Equation 4) or multidimensional (Equation 5) case is that the one- or two-dimensional integrals are evaluated without quadrature, thereby avoiding extremely intensive computations. II. STUDY ONE TESTING SPECIFIC HYPOTHESES Purpose Suppose a test administrator has the answer sheets from a set of examinees whose test scores just barely exceed a minimum threshold required to be hired, promoted, or admitted to a training program. Further, suppose it is known that some examinees earned their test scores honestly, while other examinees obtained the answers to some items prior to the exam and thus obtained passing scores by cheating. The task of the test administrator is to use each examinee's pattern of item responses to determine whether a passing score was obtained honestly. Likelihood ratio The test administrator should use the likelihood ratio given in Equation 1 to decide whether a passing score was obtained honestly because no other statistic computed from the item responses provides more accurate classification. To apply Equation 1 to the problem faced by the test administrator, Pjj orma i( u ) would be interpreted as the likelihood of a response pattern u given that the examinee was responding honestly and —Aberrant^ woul< * interpreted as the likelihood of u given that the examinee was cheating. Stated simply, the likelihood ratio of Equation 1 compares the likelihood of u assuming that the examinee was cheating to the 3 likelihood of u assuming that the examinee was honest; a large likelihood ratio suggests that the examinee was in fact cheating. tor the Lest administrator to use Equation 1, there must be an explicit means for evaluating its numerator and denominator. In t"js subsection, it is shown how existing software can be used for this purpose. A l »et. from elementary probability can be used to simplify the task of evaluating liquation 1. Specifically, suppose a set A is a subset of set B. Then H(Alli) - l*( A )/H( B). (6) Equation l> can be derived from the usual formula for conditional probability P (A | B) - P (A and B)/1“(B) because P (A and B) - i J (A) when A is a subset of B. Let u. denote the range of test scores that are subject to the test administrator's scrutiny. For example, u> might consist of the set of test scores that fall into the 50th to 54th percentiles. In addition, let X be the function that maps item ['espouses into test scores. If number right scoring is used, for example, X(u) - ti + u_ + ... + u ~ -1-2 -n Let u* be a given sequence of responses such that X(u*) is in ui. With this notation, we can write the likelihood ratio that must be evaluated by the test administrator as Lf<( u*) —Aberran t ( U:U<>1 ^ u) is in formal (U;U#I ^ (U) is in Apply iti) Equation 6 to Equation 7 produces LK(u*) = -Aberrant (u - li *) < I -Aberrant - (X(u) is in ui) P.. (u-u*) -NormaI / bton» a ] ( i (u > is in ui) (7) P., ,(u-u*) -Aberrant P., ,(uu») —Norma I T- • k ( 8 ) where k is a constant and thus can be ignored by the test administrator. Of course, this formula (and the specific iO is valid only for patterns u* with X(u*) in ui. For such u # , P., , (u=u # ) can now be evaiuated by Equations 2 — —Normal and 3, and £ (u) can be evaluated by Equation 4 and the methods *“HDc r rdnL described by Levine and Drasgow (1988) and Drasgow, Levine, and McLaughlin (in press). Method Overview . A study was conducted to examine the performances of optimal and non-optimai appropriateness indices on the task faced by the hypothetical test administrator. Both real and simulated data were analyzed in the study. The results obtained from the analysis of simulated data provide information about the performance of appropriateness indices under idealized conditions where all model assumptions are satisfied; the analysis of real data provides information about the indices' performances in operational conditions. Data were generated to simulate normal responding to a test battery consisting of a test of verbal ability (V) and a test of quantitative ability (Q). In addition, data from presumably normal examinees responding to verbal and quantitative tests were analyzed. Response patterns with total test scores (V+Q) falling into two score ranges (20th through 24th percentiles and 50th through 54th percentiles) were selected. Compromise samples were formed by modifying either simulated response patterns or actual response patterns to simulate individuals who obtained total scores in the two score ranges by cheating. Appropriateness indices were computed for all response patterns, and rates of identification of the simulated cheaters were determined at various false positive rates. The real data set, item characteristic curves, and option characteristic curves . The real data used in this study were from a sample of 13,571 examinees who responded to the ASVAB, version 17A, under operational conditions. To estimate item parameters, 3,392 examinees were chosen by selecting examinees 1, 5, 9, ... 13,569. A verbal test of 50 items was formed by combining the 35 item Word Knowledge test and the 15 item Paragraph Comprehension test. A quantitative test was formed by combining the 30 item Arithmetic Reasoning test and the 25 item Mathematics Knowledge test. The quantitative test contained 54 items after one Arithmetic Reasoning item was deleted because it was very easy (its item difficulty parameter was not accurately estimated). Three-parameter logistic item characteristic curves were estimated by the method of marginal maximum likelihood with the BILOG (Mislevy & Bock, 1984) computer program. Nori-parametric estimates of ICCs and option characteristic curves based on Levine's (1985, 1989a, 1989b) Multilinear Formula Score (MFS) theory were obtained using the ForScore computer program (Williams 4 Levine, in preparation). Additional details about the non- parametric estimates were given by Lim, Williams, McCuskor, Mead, Thomasson, Drasgow, and Levine (1989). The estimated three-parameter logistic ICCs and the estimated non-parametric. I».Cs and OCCs were used in all subsequent analyses of the real data. To maximize the realism of the simulation portion of this study, ICCs and OCCs that had been estimated from the ASVAB data set were used as the "true" (i.e., simulation) ICCs and OCCs rather than an arbitrarily specified 5 set of iti'iii parameters. This choice of ICCs and OCCs increases the comparability of the results obtained from the simulation and real data. Item response models . In the portion of Study One ttiat analyzed the actual ASVAII data, examinees' item responses were scored either dichotomous Iy or polychotoniously, and appropriateness indices were computed with either the three-parameter logistic ICCs or multilinear formula scoring ICCs and OCCs. Specifically, appropriateness indices were eomputed with the following item scoring and item response models: 1. dichotomously scored responses analyzed with three-parameter logistic ICCs; 2. dichotomously scored responses analyzed with multilinear formula scoring ICCs; 3. poiychotomously scored responses analyzed with multilinear formula scoring ICCs and OCCs. For the simulation portion of Study One, data were generated for each of the three conditions listed above (e.g., three-parameter logistic ICCs were used to generate dichotomous item responses). Appropriateness indices were then computed with the model used to generate each sample, which yielded analyses of simulated data that were parallel to the analyses of real data. Ferceiu iIes . The following procedure was used to determine the total test scores corresponding to the 20th, 24th, 50th, and 54th percentiles for Study One. First, the estimated three-parameter logistic ICCs were used to generate 100,000 response patterns by the process for simulating normal response patterns (see below). Next, number-right scores were computed for each simulated verbal and quantitative test. Number-right scores on these two tests were then separately standardized and a total score was computed as the sum of the two standardized scores. Finally, the frequency distribution of the total score was tabulated and used to determine the values of the total test score association with specific percentiles. Simulated normal respons e patterns. For each of the three item response mudels listed above, a simulated normal response pattern (i.e., a non-cheater) was created by sampling 0 = [0,, 0,| from the standardized bivariate normal distribution with correlation .7. 9, was used with the simulation ICCs and OCCs for the verbal test to generate locally independent item responses. Similarly, 0_. was used to generate locally independent item responses for the quantitative test. Response patterns were repeatedly generated until 4,000 simulated examinees were collected for the low score range (20th through 24th percentiles) normal sample and lor the moderate score range (50th through 54th percentiles) normal sample. Heal normal response patterns . Real normal response patterns were obtained by first selecting each response pattern that was not included in the sample used to estimate ICCs and OCCs (i.e., response patterns were taken from the magnetic tape containing 13,571 response patterns, but the 3,392 patterns used for item calibration were excluded). Next, a total test score was computed for each response pattern in the manner described previously. Response patterns with total test scores in either the low 6 score range or the moderate score range were then written to separate files. A total of t(80 response patterns had total test scores in the 20th through 24th percentiles and 533 response patterns had total test scores in the 50th through 54 th percentiles. Spuriously high manipulation applied to simulated data . Cheating was simulated by first generating a normal response pattern and then rescoring ic item responses to be correct, regardless of the original response. The rescored items were randomly selected for each response pattern, and so Levine and Drasgow's ( 1988) method for evaluating I^berrant^ u ) cou ld tie applied directly. Response patterns were generated with 5, 10, or 15 items per test rescored to simulate cheating. This process was continued until 2,000 response patterns with total scores in the low score range and moderate score range were collected. An attempt was made to generate 18 samples by factorially crossing the three item response models, the three levels of simulated cheating (5, 10, or 15 items per test), and the two score ranges (20th through 24th percentiles and 50th through 54th percentiles); however, the 15 item spuriously high manipulation consistently produced response patterns with total scores that exceeded the 24th percentile. Consequently, it was possible to obtain only 15 spuriously high samples. Spuriously high manipulation applied to real data . Only response patterns not used for item calibration and not in either normal sample were subjected to the spuriously high manipulations. The 5, 10, and 15 item spuriously high manipulations were applied to each of these response patterns, and a response pattern was selected if its total score fell in either the low or moderate score ranges. A total of 524, 635, and 654 response patterns were obtained for the moderate score range in the 5, 10, and 15 item spuriously high conditions. For the low score range, 408 and 310 response patterns were obtained in the 5 and 10 item conditions. Again, the 15 item spuriously high manipulation produced response patterns with test scores above the 24th percentile. Analysis . Optimal appropriateness indices were computed for the samples of simulated and real normal response patterns using the Levine and Drasgow (1988) algorithm for spuriously high responding to 5, 10, and 15 items per test. Correctly specified optimal indices were always computed; for example, the optimal index for 10 spuriously high responses per test was computed for aberrant response patterns that had been subjected to this manipulation. The non-optimal indices were also computed for each normal and aberrant sample. After computing appropriateness indices, receiver operating characteristic (ROC) curves were constructed. These curves depict the proportions of the response patterns in an aberrant sample that can be identified at various false positive rates. Of course, it is desirable to have a high detection rate (i.e., a high proportion of aberrant response patterns detected) at a Iqw false positive rate. 7 Resu J ts Rates of detection of simulated cheating for the low score range are presented in Table 1 for the simulated data. From Table 1 it is evident that simulated cheating on five items per test was very difficult to detect: Only 26% of the simulated cheaters were detected by the most sophisticated analysis when the false positive rate was 5%. The optimal index computed for the three-parameter logistic model was able to identify just 25%. Table 1 shows that cheating on 10 items per test was much easier to identify; for example, the optimal index for the MFS analysis of polychotomously scored responses identified 67% of the simulated cheaters at a false positive rate of 5%. The detection rates were 61% and 60% when the responses were scored dichotomously and analyzed with MFS and three-parameter logistic optimal methods. Table 1 shows that the non-optimal and F2 indices had detection rates modestly below the detection rates of optimal indices for dichotomously scored responses. Their rates of detection rather substantially trailed the rates provided by the MFS optimal index for polychotomous scoring. Table 2 presents results for actual ASVAB response patterns that had been modified to simulate individuals who obtained scores in the 20th through 24th percentile by cheating. Comparing the results for simulation data summarized in Table 1 to the real data results in Table 2 shows generally lower detection rates for real data. A word of caution is needed here: It was not possible to use samples of the size that ensure inconsequential sampling fluctuations (say, 4,00u normals and 2,000 aberrarits) from the ASVAB data set. Thus, the numbers contained in Table 2 are subject to rather large sampling errors. Candell and Levine (1989) provide details about the expected sizes of sampling errors of ROC curves). Two explanations for the lower detection rates in Table 2 are readily available. First, model misspecifications of various kinds may have had detrimental effects. This explanation was examined in Study Two, which was conducted to evaluate the consequences of a variety of misspecifioations. A second explanation of the lower detection rates in Table 2 is that the normal sample used to determine false positive rates was not entirely normal. This sample, which consisted of actual ASVAB response patterns, might have contained a few truly aberrant response patterns. As one check of this latter hypothesis, the magnitudes of the likelihood ratios for 5% false positive rates where determined for the normal samples used in the simulation analyses and in the ASVAB analyses. Optimal indices were computed given the (incorrect) assumption that there were 10 spuriously high responses per test. The likelihood ratios are: Poly. MFS l>ichut. MFS Simulation normal sample 2.10 2.21 2. 19 ASVAB normal sample 5.36 4.24 3.07 8 ] Table 1 . Selected Hates of Detection of Spuriously High Response Patterns with Total Test Scores in the 20th Through 24th Percentile, Simulation Data False Pos. Rate Test PoJyehot. MFS Dichot. MFS 3PL Optimal Optimal 2 o F2 Optimal 2 0 F2 5 Spuriously High Responses Per Test .001 V 00 01 00 00 00 00 00 Q 01 01 00 01 02 00 01 MT 01 01 01 01 03 02 02 .01 V 05 04 02 02 05 03 02 Q 07 05 04 04 06 05 05 MT 10 08 05 06 08 06 07 .03 V 11 10 06 05 10 06 05 Q 14 12 10 11 13 11 11 MT 20 17 12 12 18 15 14 .05 V 15 14 09 09 13 10 09 Q 22 19 16 16 18 17 17 MT 26 24 18 19 25 20 20 . 10 V 24 22 19 18 23 21 20 Q 34 30 27 28 30 29 29 MT 42 38 30 31 36 30 30 10 Spuriously High Responses Per Test .001 V 05 04 03 01 03 02 02 Q 04 05 04 04 08 03 04 MT 10 15 10 10 19 11 09 .01 V 17 13 10 06 15 09 07 Q 25 19 17 15 19 18 15 MT 44 34 25 26 37 26 26 .03 V 28 23 19 14 24 20 17 Q 42 35 32 32 33 31 29 MT 59 51 40 39 52 44 43 .05 V 35 29 24 21 31 27 23 Q 52 45 40 39 43 39 38 MT 67 61 50 49 60 51 50 . 10 V 46 41 38 34 42 39 38 Q 67 61 56 55 60 55 55 MT 79 74 64 64 73 64 62 9 Table 2 . Selected Hates of Detection of Spuriously High Hesponse Patterns with Total Test Scores in the 20th Through 24th Percentile, Real Data False Pos. Rate 1'es t Polychot. MFS Dichot. MFS 3 PL Optimal Optimal *0 F2 Optimal i. F2 5 Spuriously High Responses Per Test .001 V 00 00 00 00 00 01 00 Q 02 00 00 01 01 00 00 MT 01 00 00 00 02 00 00 .01 V 04 03 01 00 02 01 01 Q 03 04 02 03 03 04 05 MT 05 06 02 02 07 01 02 .03 V 10 10 04 02 09 04 02 Q 07 05 05 06 08 08 07 MT 14 13 06 07 14 10 07 .05 V 14 15 08 04 14 07 03 Q 15 10 11 08 13 1 1 10 MT 19 16 15 14 20 17 17 . 10 V 26 21 12 12 21 13 11 0 24 21 21 19 25 24 21 MT 32 33 24 22 31 26 21 10 Spuriously High Responses Per Test .001 V 02 02 01 00 01 01 00 Q 08 07 01 06 07 02 05 MT 14 00 00 00 05 00 00 .01 V 08 05 02 00 05 03 01 Q 17 14 15 12 16 20 20 MT 15 16 09 07 22 09 07 .03 V 16 15 09 05 16 10 05 Q 32 27 25 21 32 31 25 MT 45 40 22 24 47 29 26 .05 V 28 23 15 07 19 14 07 Q 42 43 34 28 43 37 34 MT 52 53 36 36 58 45 43 . i0 V 40 37 25 21 32 27 17 Q 52 53 48 44 55 51 47 MT 68 68 55 53 66 59 52 10 The likelihood ratio is the ratio of the likelihood of' a response pattern given the model for aberrant responding--!!) spuriously high responses per test — to the likelihood of the response pattern given the model for normal responding. A large likelihood ratio indicates that the model for aberrant responding "explains" the response pattern better than the normal model. The likelihood ratios shown above imply that the model for aberrant responding provides a good fit (relative to the model for normal responding) for more nominally normal ASVAB response patterns than simulation normal (and hence truly normal) response patterns. Note further that the optimal index is targeted for a specific form of aberrance (spuriously high responding), unlike goodness of fit indices such as 2 0 and F2 that test for any departure from normal responding. Thus, these results are consistent with the hypothesis that some ASVAIi examinees may have received coaching. Detection rates for simulated data with total test scores in the moderate score range are shown in Table 3. Again it was very difficult to identify response patterns that had been subjected to the five items per test spuriously high manipulation. One reason for this difficulty is that the version of the Levine and Drasgow (1988) algorithm used in this study makes rio assumptions about which items were compromised; all items were assumed to be equally likely candidates for cheating. It seems likely that higher detection rates would be obtained if more were known about the relative likelihood of cheating on each item. For example, if new items introduced in a test administration or otherwise known to be secure can be assumed to have zero probability of spurious responses, then detection rates can be significantly increased by utilizing a more general version of the Levine and Drasgow algorithm. For another example, if the response options for some items are reordered because it is suspected that some examinees have memorized the answer key, the more general Levine and Drasgow (1988) algorithm can incorporate this additional information. Table 3 shows moderate detection rates for cheating on 10 items per test and high detection rates for cheating on 15 items per test. Specifically, the best index identified 70/fc of the cheaters in this latter condition when the false positive rate was 5%. The detection rates for the two optimal indices computed with dichotomously scored responses were 62)1 and 62%. The non-optimal indices detected roughly k0% at a 5% false positive rate; the optimal index for polychotomous scoring achieved a somewhat higher detection rate at a false positive rate of only \%. A generally similar pattern of results was obtained in the analysis of the actual ASVAB data. Table 9 shows that it is a difficult task to identify cheating by near average ability examinees on a small to moderate number of items (5 or 10 items). Even the best appropriateness indices detect no more than 30Jt of such response patterns at a 5? false positive rate. These aberrant response patterns are difficult to identify because a substantial number of items were answered correctly before the spuriously high manipulation was applied. Thus, the aberrance manipulation does not produce a particularly unusual response pattern, namely one with several correct answers to hard items juxtaposed with incorrect answers to easy items. Table j. Selected Hates of Detection of Spuriously High Response Patterns with Total Test Scores in the 50th Through 54th Percentile, Simulation Data False Pos. Polyohot. MFS Dichot. MFS __3PL Hate Test Optima L Optimal F2 Optimal K F2 5 Spuriously High Responses Per Test .001 V 00 00 00 00 00 00 00 Q 01 01 00 01 01 01 01 MT 01 00 00 00 00 00 00 .01 V 03 03 02 01 03 02 01 Q Oil 03 03 03 03 03 03 MT 05 04 02 02 05 04 03 .03 V 07 07 04 03 08 05 03 Q 09 08 08 06 09 09 00 MT 12 09 07 07 10 09 09 .05 V 10 10 07 05 12 08 05 Q 13 13 10 11 14 12 12 MT 17 15 11 11 16 12 12 .10 V 19 19 14 12 21 15 13 Q 23 22 21 20 23 20 21 MT 28 26 20 19 27 21 21 10 Spuriously High Responses Per Test .001 V 01 01 00 00 01 01 00 Q 03 03 01 01 02 02 01 MT 05 01 02 01 02 01 01 .01 V 07 06 02 01 05 03 01 Q 14 08 07 06 10 10 08 MT 18 12 06 07 14 08 07 .03 V 13 12 06 04 13 07 04 Q 26 20 16 15 21 20 19 MT 32 27 14 15 26 18 18 .05 V 18 17 10 07 18 10 06 Q 32 28 21 21 29 26 25 MT 41 36 21 23 35 24 25 . 10 V 28 28 17 14 27 19 15 Q 46 42 36 36 44 37 37 MT 56 52 33 34 49 36 38 12 Table 3 (concluded) False Pos. Rate Test Polychot. MFS Dichot. MFS 3PL Optimal Optimal F2 Optimal 4. F2 15 Spuriously High Responses Per Test .001 V 07 05 01 00 05 03 00 Q 09 06 03 04 05 06 05 Ml' 21 08 06 03 07 05 05 .01 V 16 14 05 01 15 08 01 Q 29 22 17 16 25 20 18 MT 46 33 17 16 36 23 19 .03 V 28 25 11 06 27 14 06 Q 48 39 31 30 41 34 36 MT 62 53 31 30 53 36 36 .05 V 34 30 17 10 34 20 11 Q 56 47 37 38 51 42 42 MT 70 62 40 41 62 44 44 . 10 V 46 42 27 20 44 32 23 Q 69 61 52 53 65 53 54 MT 82 74 54 54 75 57 59 1 3 Table 4 . Selected Rates of Detection of Spuriously High Response Patterns with Total Test Scores in the 50th Through 54th Percentile, Real Data False Pos. I’olychot. MFS Dichot. MFS 3 PL Hate Test Optimal Optimal B 0 F2 Optimal B 0 F2 5 Spuriously High Responses Per Test .001 V 00 00 00 00 00 00 00 Q 00 00 00 00 02 00 00 MT 00 00 01 00 01 00 00 .01 V 00 01 01 01 02 02 01 Q 03 01 02 01 03 02 02 MT 01 02 02 01 04 04 03 .03 V 02 06 04 02 05 05 04 Q 08 09 07 06 09 06 06 MT 07 09 06 06 13 07 07 .05 V 07 10 07 06 08 08 06 Q 11 11 11 10 14 12 10 MT 11 12 11 09 17 12 10 .10 V 16 15 11 10 15 13 09 Q 21 21 20 19 22 20 21 Mr 20 24 20 18 23 20 20 10 Spuriously High Responses Per Test .001 V 00 00 00 00 00 00 00 Q 02 02 00 00 06 01 00 Ml' 02 02 00 01 03 01 01 .01 V 01 02 01 00 03 01 00 Q 06 08 05 03 09 05 04 MI* 07 09 06 04 12 08 05 .03 V 06 08 04 03 10 07 05 Q 18 18 14 14 19 13 15 Mr 16 22 12 13 22 14 15 .05 V 12 15 08 06 13 10 07 Q 23 21 20 18 24 21 19 Mr 26 28 19 18 30 20 19 . 10 V 22 23 14 10 23 16 11 Q 36 32 31 29 34 31 31 Ml' 41 37 29 29 40 33 32 14 Table 4 (conciudeu) False Pos. Rate Test Polychot. MFS Dichot. MFS 3PL Optimal Optimal 2„ F2 Optimal 2 o F2 15 Spuriously High Responses Per Test .001 V 02 02 02 00 01 03 00 Q 04 04 03 00 10 05 01 MT 08 04 03 02 09 04 04 .01 V 08 08 04 00 08 05 01 Q 20 16 12 07 18 12 10 MT 27 24 14 08 25 18 12 .03 V 16 22 09 04 23 12 08 Q 31 31 24 23 30 24 25 MT 41 43 25 25 44 28 29 .05 V 27 27 15 09 29 18 10 Q 41 35 33 30 40 34 30 MT 50 53 36 33 54 39 36 .10 V 37 39 24 17 39 26 18 0 57 51 44 43 53 44 46 MT 65 66 51 50 68 53 52 15 The rates of' detection of response patterns subjected to the 15 item per test spuriously high manipulation are moderately high. For example, about 50% of these patterns are detected at a 5% false alarm rate. This higher detection rate is of course in part due to the severity of the manipulation. But, an important additional ingredient is that prior to the spuriously high manipulation the response patterns were indicative of fairly low ability. Thus, the patterns contained some incorrect answers to easy items. When the spuriously high manipulation resulted in correct answers to some of the harder items, detection of the simulated cheating was possible. Hates of detection are somewhat lower in Table 4 than in Table 3, which again may be due to one of the forms of model misspecification examined in Study Two or due to the inclusion of truly aberrant response patterns in the nominally normal ASVAB sample. Likelihood ratios yielding a 5% false positive rate were determined for the ASVAB and simulation normal samples given the assumption of 10 likelihood ratios are: spuriously high responses per test. The Poly. MFS Dichot. MFS 3PL Simulation normal sample 4. 10 3.94 3.86 ASVAB normal samp le 7.60 5.73 5.17 As with the lower ability range, the likelihood ratios suggest that some aberrant response patterns may have been included in the nominally normal ASVAB sample. III. STUDY T WO ROBUSTNESS OF OPTIMAL INDICES TO VIOLATIONS OF ASSUMPTIONS Purpose There are a variety of violations of the optimal indices' assumptions that could create problems in operational settings. These violations include: 1. the use of estimated ICCs and OCCs in place of the true ICCs and OCCs; 2. violations of local independence that surely occur in real data; 3. differences between the assumed ability density in Equation 5 and the true ability density. In addition to these three forms of model misspecification, another kind of misspecification is sure to occur in operational settings. The Levine and Drasgow (1988) algorithm assumes that the number of spuriously high or spuriously low responses on each test is known. However, such information is not usually available when a test is administered to examinees who may have been coached in a variety of ways. Thus, a fourth model 16 misspecification consists of violations of the assumed number of spurious responses per test. Each of these four model misspecifications was investigated in Study Two. In each case, a misspecified index was computed in addition to the truly optimal index. Comparing the detection rates of the truly optimal index to the misspecified index shows the impact of the misspecification. Method Item characteristic curves and option character-isti c curves . Although Study Two was entirely a simulation study, it was desirable to make the simulation as realistic as possible. For this reason, the very accurate estimates of item and option characteristic curves were obtained for the ASVAB items from Study One. To this end, response patterns 1, 3, 5,... were initially selected from the complete sample, yielding a total of 6,785 patterns. To reduce this sample to a more manageable sixe, but still obtain very accurate ICC and OCC estimates, some examinees with average abilities were excluded whereas all examinees with extreme abilities were retained. (Estimation of lCCs and OCCs is typically very accurate for moderate ability ranges, but far less accurate in extreme ability ranges.) To avoid systematically violating local independence, response patterns were excluded on the basis of their scores on the 35 item General Science (GS) test rather than the verbal or quantitative tests. Response patterns with GS number-right scores of 15, 17, or 19 were deleted. This left a sample of 5,301 patterns, as 503, 518, and 1|63 patterns had scores of 15 , 17, and 19 , respectively. As in Study One, marginal maximum likelihood estimates of the item parameters of the three-parameter logistic model were obtained with the BILOG (Mislevy & Bock, 1984) computer program and non-parametric estimates of ICCs and OCCs based on Levine's (1985, 1989a, 1989b) MFS theory were obtained with the ForScore computer program. Fit plots showed very accurate modeling of empirical proportions for the multilinear formula scoring ICCs and OCCs. Figure 1 shows a typical fit plot; the multilinear formula scoring estimate of the ICC is given by the dashed line in the upper left panel; the solid lines in the other three panels show conditional OCCs (OCCs divided by (1-P^(«)]) for the three incorrect options. Samples and analyses . The following general process was used to evaluate the effects of each of the four forms of misspecification described above. First, a normal sample of 4,000 response patterns was generated with the ICCs and OCCs described above. Then (except for the misspecified aberrance condition) two samples of 2,000 aberrant response patterns were generated, again with the ICCs and OCCs estimated from the sample of 5,301. One sample contained normal response patterns that had been subjected to the 10 item per test spuriously high manipulation, and the other sample contained patterns subjected to the 10 item per test spuriously low manipulation. Four aberrant samples of 2,000 patterns were created for the aberrance misspecification condition. Here samples were created with 5 and 15 item per test spuriously high manipulations and with 5 and 15 item per test spuriously low manipulations. 17 P(theta) P(theta) Item 13 Option 1 Option 2 Option 3 Option 4 Theta Theta Figure 1 . Fit Plots for an Item Characteristic Curve and Three Conditional Option Characteristic Curves Obtained with the ForScore Computer Program. 18 For three of the misspecifications, 0 = |G,,0.,| was sampled from the standardized bivariate normal distribution with correlation .7. The sampling of 0 values in the misspecified ability density condition is described below. Note that there was no selection of response patterns as in Study One; all normal and aberrant response patterns were included. A separate analysis was conducted to evaluate each of the four forms of misspecification. In each case, correctly specified optimal indices were computed as well as incorrectly specified optimal indices. The first form of misspecification consisted of computing optimal indices with estimated ICCs and OCCs in place of the true ICCs and OCCs. To examine the effects of this substitution, the multilinear formula scoring ICCs and OCCs were used to simulate a test calibration sample of 3,000 response patterns. Then multilinear formula scoring ICCs and OCCs were estimated from this sample of 3,000 using the ForScore program and three- parameter logistic ICCs were estimated with the BIL0G program. Finally, optimal appropriateness indices were computed for the normal and aberrant response patterns described above using the correct ICCs and OCCs as well as the estimated (from the simulated calibration sample of 3,000) ICCs and OCCs. Note that the multilinear formula scoring ICCs and OCCs estimated from the simulation sample of 3,000 response patterns differ from the simulation ICCs and OCCs only to the extent of estimation error. In contrast, the three-parameter logistic ICCs estimated from the sample of 3,000 differ from the simulation ICCs both because of estimation errors and the fact that the true ICCs were not exactly three-parameter logistic. It seemed reasonable to incorporate this latter type of misspecification for the three-parameter logistic because ICCs are not necessarily correctly modelled by curves in the three-parameter logistic family. The second form of misspecification investigated in Study Two consisted of violations of local independence. As described previously, item responses were generated to simulate a two-dimensional test where the two latent traits had a correlation of .7. The misspecified optimal indices made the incorrect assumption that the entire item pool of 1u9 items was unidimensional. Then optimal indices for a single long unidimensional test were computed in the misspecification condition; the correctly specified multi-test optimal indices were also computed A misspecified ability density was the third form of misspecification studied. In earlier research (e.g., Drasgow, Levine, McLaughlin, & Earles, 1987), the ability density £(•) in Equation 5 has been taken as the standard normal. This density is undoubtedly incorrect for a population of examinees when there has been self-selection or some other selection prior to administration of the exam (e.g., when recruiter’s prescreen applicants). To simulate ability density misspecification, two numbers X and Y were sampled from a truncated chi-square distribution with 10 degrees of freedom (the bottom .01% and top 1.AJ of the distribution were discarded since multilinear formula scoring ICCs and OCCs were defined only for Us less than 3 iri absolute value). Then 0, was taken as [X - E( X ) |/VVar( X j (i.e., a standardized version of the truncated chi-square). The density of 0, is 19 shown in Figure 2, along with the standard n r -.mal density. 0, was constructed by first standardizing Y and then computing 8,, = a0, + 0 - a)z , where z is the standardized If and a = .4995 was chosen so that 0, and 0 2 had a correlation of .7. Finally, tnisspeeified optimal indices were computed with the incorrect assumption that |0,,0 2 ] was sampled from the standardized bivariate normal distribution with correlation .7. Correctly specified optimal indices were also computed. The final misspecification concerned the number of aberrant responses made by an examinee. Test administrators ordinarily do riot know how many item responses might be aberrant. To evaluate the performances of optimal indices under these conditions, response patterns with 5 or 15 aberrant responses per test were created, and then the optimal index for 10 aberrant responses per test was computed as well as the correctly specified optimal index. Results True versus estimated ICCs and OCCs . Table 5 presents selected detection rates of spuriously high and spuriously low response patterns for the ICC and OCC misspecification condition. From this table it is evident that only minimal reduction in detection rates occurred as a result estimation error. The greatest shrinkage was expected for the polychotomous MFS analysis; here the detection rates for optimal indices computed for true and estimated ICCs and OCCs were 85? and 82? in the spuriously low condition and 39? and 36? in the spuriously high condition when the false positive rate was 5?. This small amount of shrinkage clearly indicates that the effects of the estimation errors obtained with a calibration sample of 3,000 were generally inconsequential. There is one discrepant value in Table 5: When the false positive rate was .001, the detection rate for the polychotomous MFS multi-test optimal index was much lower for estimated ICCs and OCCs in the spuriously low condition. Although this result may be due to errors of estimation of the ICCs and OCCs, it may also be due to the fact that Table 5 presents empirical detection rates (i.e., the numbers in Table 5 would be different if we replicated our analysis but used a different seed for the random number generator). The cutting score for classification is determined from only 4 normal response patterns when the false positive rate is .001; this cutting score is likely to have considerable sampling error. Very little decrement in detect ion rates is evident in the dichotomous MFS analysis. This finding corroborates results obtained by Levine, Drasgow, Williams, McCusker, and Thomasson (under review), who found very small estimation errors with their "ideal observer" methodology (i.e., an observer who uses an optimal statistical procedure to distinguish response patterns generated from true versus estimated ICCs). Finally, the detection rates for the estimated three-parameter logistic ICCs are nearly as high as the rates for the dichotomous analysis with the true multilinear formula scoring ICCs. From this finding it appears that the Joint effects of estimation errors and departures from the three- parameter logistic parameter form were generally inconsequential. Note, 2U Table 5. Selected Kates of Detection of Aberrant Response Patterns by the Likelfhood Ratio Evaluated with True and Estimated Item Parameters False Pos, Kate ! «j ;j t. Polychot. MFS Dichot . MFS 3 PL True Est. True Est. Est. 10 Spuriously Low Responses Per Test .001 V 29 26 23 22 19 Q 18 13 06 06 07 MT 41 16 23 25 20 .01 V 56 51 38 38 38 Q 28 27 13 14 15 MT 69 64 47 47 43 .03 V 68 65 52 52 51 Q 40 37 23 23 22 MT 80 76 59 58 59 .05 V 75 73 59 58 57 Q 48 46 29 28 28 MT 85 82 66 65 64 .10 V 85 84 71 70 69 Q 63 60 40 39 37 MT 91 90 77 76 75 10 Spuriously High Responses Per Test .001 V 02 02 01 01 02 Q 03 02 02 02 03 MT 05 05 03 05 04 .01 V 07 07 06 06 06 Q 12 12 09 10 10 MI' 19 18 13 13 14 .03 V 14 14 12 12 13 Q 24 22 19 19 18 Mr 32 29 25 24 25 .05 V 19 18 17 15 18 Q 31 28 26 26 23 Mr 39 36 31 30 32 .10 V 30 28 27 26 27 Q 43 40 39 37 36 Mr 50 46 43 43 45 22 however, that the detection rates for both dichotomous analyses fall short of the polychotomous model detection rates. These differences are especially large for the spuriously low response patterns. DimensionaIity misspecification . Table 6 presents results for the misspecification condition in which two-dimensional item responses are analyzed with a one-dimensional model. Kesults for the correctly specified multi-test analyses are given beneath the columns headed MT. Substantial drops in rates of detection of both spuriously high and spuriously low response patterns are apparent for all three types of analyses. For example, when the false positive rate is 3t there was a 1751 decrease in the rate of detection of spuriously low response patterns by the polychotomous MFS analysis (i.e., 80% detection in the correct analysis versus 63% in the misspecified analysis) and there were 18% decreases for the dichotomous MFS analysis and the three-parameter logistic analysis. A similar pattern of results occurs for the spuriously high response patterns. The detection rates shown in Table 6 indicate that optimal appropriateness measurement is affected by serious violations of unidimensionality. Specifically, it is clear that detection rates are markedly decreased by combining the simulated verbal and quantitative tests and then performing a unidimensional analysis. This finding underscores the importance of earlier research that developed optimal multi-test appropriateness indices (Drasgow, Levine, & McLaughlin, in press; Levine, in preparation). Misspecified ability densities . Table 7 presents the results for the response patterns created with ability parameters obtained from truncated chi-square distributions but analyzed with the incorrect assumption that the ability distribution was bivariate normal. A very high degree of robustness to this form of misspecification can be seen in Table 7 for all item response models and both types of aberrant response patterns. The robustness to ability density misspecifications is a result of the equations for the marginal likelihood of a response pattern given in Equations 3 and 4. From these equations it can be seen that the marginal likelihood is the integral of the product of the conditional likelihood of the response pattern and the ability density. For tests of moderate length or longer, the ability density is ordinarily very flat in relation to the conditional likelihood. For example, the maximum of the normal density is about eight times larger than the minimum density on the interval [-2, 2). In contrast, the maximum of the conditional likelihood may be 10'" or even 10'° times larger than its minimum on the same interval (Levine & Drasgow, 1988, p. 170). Consequently, the value of the integral is determined primarily by the conditional likelihood function for tests as long as the verbal and quantitative tests simulated here. Incorrect specification of the number of aberrant responses . The results for the final form of misspecification are given in Table 8. Here response patterns were generated with either 5 or 15 aberrant responses per test generated; optimal indices were then computed with the correct assumption about the number of aberrant responses or analyzed with the incorrect assumption that 10 item per test were aberrant. 23 1 Table 6 . Selected Kates of Detection of Aberrant Response Patterns by the Likelihood Ratio with Correct and Incorrect Assumptions about Dimensionality False Pos. Po1ychot. MRS Dichot. MFS 3PL Rate MT One Test MT One Test MT One Test Data Generated with 10 Spuriously Low Responses Per Test .001 41 11 23 05 24 05 .01 69 47 47 20 43 19 .03 80 63 59 41 54 36 .05 85 73 66 51 62 46 .10 91 85 77 67 73 62 Data Generated with 10 Spuriously High Responses Per Test .001 05 00 05 02 05 00 .01 19 04 15 05 16 04 .03 32 14 25 13 25 13 .05 39 22 32 20 32 20 .10 50 36 46 32 46 33 24 Table 7 . Selected Hates of Detection of Aberrant Hesponse Patterns by the Likelihood Ratio Evaluated with Correct and Misspeeified Ability Densities False Pos. Rate Test Poiychot. MFS Dichot. MFS 3 PL Correct Misspec. Correct Misspec. Correct Misspec. 10 Spuriously Low Responses Per Test .001 U 31 31 20 19 19 18 Q 13 11 04 03 04 04 MT 38 42 20 19 22 20 .01 V 53 54 35 34 35 35 Q 26 26 10 10 12 11 MT 66 67 43 41 43 42 .03 V 614 64 49 49 48 47 Q 140 39 19 19 21 19 MT 81 80 58 57 57 56 .05 V 73 74 56 57 56 54 Q 48 47 24 25 27 26 MT 86 86 67 64 66 64 .10 V 83 83 70 69 68 67 Q 61 61 38 38 40 39 MT 94 93 78 78 77 77 10 Spuriously High Responses Per Test .001 V 02 00 02 02 01 01 Q 02 01 02 01 01 01 MT 05 00 02 01 03 03 .01 V 08 07 07 06 05 05 Q 11 08 11 08 10 10 MT 18 14 14 12 13 12 .03 V 15 15 14 14 11 11 Q 22 21 21 20 21 19 MT 31 28 28 26 24 26 .05 V 21 21 18 18 17 17 Q 32 30 28 26 26 26 MT 39 38 34 33 31 30 . 10 V 31 31 28 27 26 25 Q 44 42 40 40 38 36 MT 54 52 49 48 46 45 25 Table 6 . Selected Rates of Detection by the Likelihood Ratio with Correct and Incorrect Specifications of the Number of Aberrant Responses False Pos. Rate Test Polychot. MFS Aberr. Assumption Dichot. MFS Aberr. Assumption Aberr. 3PL Assumpti on 5 10 15 5 10 15 5 10 15 Data Generated with 5 Spuriously Low Responses Per Test .001 V iy 17 08 07 07 06 Q 09 09 01 00 02 01 MT 18 15 11 05 09 05 .01 1/ 32 27 21 16 19 15 Q 15 12 07 06 08 06 MT 91 31 26 17 25 17 .03 V 99 37 32 26 29 23 Q 23 21 12 10 13 11 MT 53 99 37 28 33 26 .05 V 51 93 39 33 35 29 Q 29 27 16 15 17 19 MT 59 51 92 35 90 32 .10 V 62 57 99 92 98 92 Q 39 37 25 22 26 23 MT 68 69 59 98 52 95 Data Generated with 15 Spuriously Low Responses Per Test .001 V 95 95 22 26 23 28 Q 19 23 06 07 09 09 MT 53 57 25 31 31 39 .01 V 65 69 98 52 93 98 Q 91 92 18 21 20 19 MT 83 87 60 65 53 56 .03 V 81 83 69 66 58 61 Q 55 58 32 32 28 30 MT 91 92 72 75 67 69 .05 V 86 87 70 73 67 69 Q 65 67 39 39 37 38 MT 99 99 79 80 75 78 .10 V 93 93 82 89 79 80 Q 76 77 51 59 98 50 MT 97 98 89 91 86 87 26 Table 8 (concluded) False Pos. Rate Test Polychot. MFS Aberr. Assumption 5 10 15 Dichot. MFS Aberr. Assumption 5 10 15 Aberr. 5 3PL Assumptior 10 15 Data Generated with 5 Spuriously High Responses Per Test .001 V 00 00 01 01 00 00 Q 00 00 00 01 00 00 MT 00 00 01 01 00 00 .01 V 03 03 03 03 03 03 Q 05 04 04 04 04 04 MT 07 06 07 06 05 05 • 03 V 08 08 08 08 06 06 Q 11 10 10 08 10 08 MT 15 13 13 11 12 11 .05 V 12 12 11 11 10 10 Q 15 14 13 13 14 13 MT 19 18 16 16 17 15 .10 V 22 20 19 19 17 17 Q 24 22 24 22 22 21 MT 30 27 28 26 27 24 Data Generated with 15 Spuriously High Responses Per Test .001 V 05 06 03 04 03 04 Q 09 12 03 11 07 07 MT 11 07 06 14 09 10 .01 V 13 14 09 09 12 13 Q 24 26 16 19 16 20 MT 35 37 23 27 26 29 .03 V 22 22 17 16 21 21 Q 39 40 28 31 27 31 MT 48 51 38 41 38 41 .05 V 28 30 22 23 26 27 Q 45 48 37 38 35 39 MT 55 59 46 48 44 48 .10 V 40 42 35 35 36 36 Q 56 60 51 52 49 52 MT 68 72 59 60 58 62 27 Surprisingly modest drops in detection rates were obtained for this form of misspecification. An examination of Table 8 indicates that the least robustness occurred for the response patterns generated with five spuriously low responses per test. At a 51 false positive rate, the drops in detection rates were Just 8% for the polychotomous MFS model, 71 for the dichotomous MFS model, and 8 % for the 3PL model. Although further analyses would be needed to corroborate this observation, it appears from Table 8 that a greater degree of robustness is obtained when a response pattern is analyzed with a misspecified number of aberrant responses that is sma11er than the actual number of aberrant responses. The converse analysis, in which the misspecified number of aberrant responses is larger than the actual number of aberrant responses, yielded somewhat larger drops in detection rates. IV. CONCLUSIONS AND DISCUSSION The major purpose of the research described in this paper was to explore the possibility of using optimal appropriateness indices to address practical testing problems. To this end, it was shown that existing algorithms for evaluating optimal indices could be tailored for a specific problem (i.e., testing the hypothesis that a response pattern with a total test score in a narrow range was obtained honestly or dishonestly) and evaluated the performance of the resulting optimal test. An interrelated set of simulations was also conducted to examine the robustness of optimal tests to violations of assumptions. There can be little doubt that some examinees may be tempted to cheat when valued outcomes are contingent upon obtaining a test score exceeding some cutoff value. Moreover, the use of cutoffs to determine allocation of valued outcomes is very common: recruitment bonuses, minimum qualification for military enlistment, professional licensing (e.g., nursing, attorney's bar examinations), certification, and state and local public sector hiring. A way that test administrators can combat cheating has been described in this paper. The statistic given in Equation 8 provides a most powerful test of the hypothesis that an examinee obtained a score barely exceeding some cutoff by honest means against the alternative hypothesis that the barely passing score was obtained by cheating on k items. Of course, the optimal appropriateness index cannot replace careful proctoring during exam administration, routine replacement of old test forms with new test forms, and other security measures. Nonetheless, it does give the test administrator an additional method for identifying cheating. Moreover, test takers may be dissuaded from attempting to cheat if they know that their responses will be examined for indications of cheating. Tables 1 through 4 give rates of detection of simulated cheaters who obtained scores in a moderately low (20th through 24th percentiles) or Just above average (50th through 54th) score ranges. The results given in these tables provide news that is both bad and good. The bad news Is that it is very difficult to distinguish between normal response patterns with test scores in a narrow score range and patterns from examinees who cheated on a 28 few items (5 or 10 per test) in order to obtain test scores in the same range. This result is not too surprising because some of the honest examinees obtained test scores in the given score range by chance rather than merit. Specificaliy, consider a plot of the frequency distribution of 0 or true score for people with observed scores between, say, the 50th and 54th percentiies for some unidimensional test. We would observe many people with 0s or true scores that fall outside the 50th through 54th percentiles. The point is that restricting observed scores to lie within some percentile range does not guarantee that 0s or true scores will fall in the same percentile range. Some lower ability examinees obtained test scores in the score range because they were lucky and some higher ability examinees obtained test scores in the score range because they were unlucky. Given just a response pattern, the effects of "luck" (i.e., a few extra correct responses) and the effects of cheating on a few items (again, a few extra correct responses) are very difficult to differentiate. Some of the cheaters have 0s in or even above the percentile range. Others have 0s just below the percentile range and would therefore have close to a 50% chance of obtaining an observed score in the percentile range if they were retested with a different test form. In sum, there is little practica1 need to identify cheaters with 0s that are close to or in the percentile range, although ethical and policy considerations may deem otherwise. Turning now to the good news from Study One, Tables 1 through 4 show that it is possible to identify simulated cheating on a relatively large number of items. For the lower test score range, reasonably high rates of detection were obtained with simulated cheating on 10 items per test. Fairly good detection rates were also obtained with cheating on 15 items per test for the Just-above-average score range. Identifying individuals who cheat on a large number of items is particularly important because these people have 0s that are far below noncheaters. The results obtained in Study Two clearly suggest that optimal indices can be used effectively in appiied settings. Only one form of model misspecification substantially decreased detection rates. This type of misspecification would occur if a test administrator were to combine a verbal test and a quantitative test and treat the composite as a long unidimensional test. Such an event, perhaps based on the argument that typical paper-and-pencil tests are "highly g saturated," would seriously undermine attempts to identify aberrant response patterns. Instead, multi¬ test optimal appropriateness indices (Drasgow, Levine, & McLaughlin, in press) should be computed because they provide far more effective identification of aberrance in the context of a battery of several unidimensional tests. Three other forms of misspecification were found to have little or no effect on detection rates in Study Two. Perhaps the most important of these three types of misspecification concerns item parameter estimation errors. In a practical setting, there is never access to the "true" item parameters; at best there are only item parameters estimated from data provided by a large and representative sample. Table 5 shows that there was little decrement in detection rates due to estimation errors for either MFS estimation or 3PL estimation. These results corroborate and extend earlier research on MFS estimation via the ForScore computer program (Drasgow, Levine, Williams, McLaughlin, & Candell, in press; Lim et al., 1989; 29 Williams & Levine, 1984, 1986) and 3PL estimation with the BILOG computer program (Levine et al., under review; Lim & Drasgow, in press; Mislevy, 1986; Misievy & Stocking, 1989). It was thus concluded that estimated item parameters can be used effectively in place of the true parameters, provided that the estimates were obtained from a large, representative sample. Table 7 shows that even a rather badly misspecified ability density has little effect on detection rates, at least for tests of the length simulated in Study Two (50 and 54 items) and the one ability density in this study. This result is convenient because it means that test administrators do not need to be concerned with density estimation. Misspecified ability densities may have a significant effect on shorter tests where the ability density exhibits considerable variation relative to the likelihood function. In such cases it may be necessary to estimate the ability density (see, for example, Levine, 1989a; Mislevy, 1984; or Samejima, 1981). The final form of misspecification concerned the number of aberrant responses. Table 8 presents the surprising result that an analysis assuming 10 spur'ously low responses per test for response patterns that actually had 5 or 15 puriously low responses per test was almost as effective as the truly optimal analysis. A similar finding was obtained for spuriously high responses. These results provide a contrast between longer, paper-and- pencil tests and short computerized adaptive tests (CATs): Candell and Levine (1989) found larger drops in detection rates when the number of aberrant responses was misspecified on a 15 item CAT. The results from Studies One and Two lead to the following suggestion for the use of appropriateness measurement in an applied setting. First, the test administrator should make a judgment about the minimum number -k of spuriously high or spuriously low responses that is needed in order to constitute a nontrivial practical problem. An optimal appropriateness index could be computed assuming k aberrant responses, perhaps using existing algorithms and software. Finally, response patterns with index scores that exceed a threshold associated with some acceptable false positive rate could be flagged, and the examinees retested. Implicit in the above suggestion is the need for item parameters estimated from a large and representative sample. The suggestion also builds on the misspecification analyses that found ability density misspecification to be unimportant and found robustness to misspecification of the number of aberrant responses. Finally, the utilization of appropriateness indices, perhaps in the manner outlined above, would be expected to improve the quality of a testing program. It would allow identification of some response patterns with modest degrees of aberrance and effective detection of patterns with substantial degrees of aberrance and might thereby deter cheating. It would provide individual test takers with some assurance that their aptitudes had been accurately measured. For these reasons it is recommended that testing programs seriously consider implementing appropriateness measurement. 30 REFERENCES CandelJ , G. R., & Levine, M. V. (1989)- Appropriateness measurement For computerized adaptive tests (AFHRL-TP-89-15). Brooks AFB, TX: Manpower and Personnel Division, Air Force Human Resources Laboratory. Drasgow, F., Levine, M.V., & McLaughlin, M. E. (1987). Detecting inappropriate test scores with optimal and practical appropriateness indices. Applied Psychological Measurement . 11 , 59-79. Drasgow, F., Levine, M.V., & McLaughlin, M. E. (in press). Multi-test extensions of practical and optimal appropriateness indices. Applied Psychological Measurement . Drasgow, F., Levine, M.V., McLaughlin, M. E., & Earles, J. A. (1987). Appropriateness measurement (AFHRL-TP-87-6, AD-A184185)- Brooks AFB, TX: Manpower and Personnel Division, Air Force Human Resources Laboratory. Drasgow, F., Levine, M. V., & Williams, E. A. (1985). Appropriateness measurement with polychotomous item response models and standardized indices. British Journal of Mathematical and Statistical Psyclulogy, J8, 67-86. Drasgow, F., Levine, M. V., Williams, B., McLaughlin, M.E., & Candell, G. L. (in press). Modeling incorrect responses to multiple-choice items with Multilinear Formula Score theory. Applied Psychological Measurement , 12 . Levine, M. V. (1985). Classifying and representing ability distributions (Measurement Series 85-1). Champaign, IL: University of Illinois, Department of Educational Psychology. Levine, M. V. (1989a). Classifying and representing ability distributions (Measurement Series 89-1). Champaign, IL: University of Illinois, Department of Educational Psychology. Levine, M. V. (1989b). Parameterizing patterns (Measurement Series 89-2). Champaign, IL: University of Illinois, Department of Educational Psychology. Levine, M. V. (in preparation). Properties of likelihoods of response patterns for short and lor i g tests . Levine, M. V., 4 Drasgow, F. (1988). Optimal appropriateness measurement. Psychometrika , 53 . 161-176. Levine, M. V., Drasgow, F., Williams, B., McCusker, C., & Thomasson, G. L. (under review). Distinguishing between item response theory models . Levine, M. V., & Rubin, D. B. (1979). Measuring the appropriateness of multiple-choice test scores. Journal of Educational Statistics , 4, 269- 289. 31 Lim, R. G. , & Drasgow, F. (in press). An evaluation of two methods for estimating item response theory parameters when assessing differential item functioning. Journal of Applied Psychology . Lim, R. G., Williams, B. , McCusker, C., Mead, A., Thomasson, G. L., Drasgow, F., & Levine, M. V. (1989). A nonparametric polychotomous model and estimation procedure . Paper presented at the 1989 Office of Naval Research Contractors' Meeting on Model-Based Psychological Measurement, Norman, OK. Mislevy, R. J. (1984). Estimating latent distributions. Psychometrika , 49, 359-382. Mislevy, R. J. (1986). Bayes modal estimation in item response models. Psychometrika , 51 , 177-195. Mislevy, R. J., & Bock, R. D. (1984). BILOG II user's guide . Mooresville, IN: Scientific Software. Mislevy, R. J., & Stocking, M. L. (1989). A consumer's guide to LOG 1ST and BILOG. Applied Psychological Measurement , 13 , 57-75. Rudner, L. M. (1983). Individual assessment accuracy. Journal of Educational Measurement , 20 , 207-219. Samejima, R. (1981). Final report: Efficient methods of estimating the operating characteristics of item response categories and challenge to a new model for the multiple-choice item (Technical Report). Knoxville, TN: University of Tennessee, Department of Psychology. Sato, T. (1975). The construction and interpretation of S-P tables (in Japanese). Tokyo: Meiji Tosha. Tatsuoka, K. K. (1984). Caution indices based on item response theory. Psychometrika . 49, 95-110. Williams, B., & Levine, M. V. (1984). Maximum likelihood for qualitative models . Paper presented at the 1984 Office of Naval Research Contractors' Meeting on Model-Based Psychological Measurement, Princeton, NJ. Williams, B., & Levine, M. V. (1986). The shapes of item response functions . Paper presented at the 1986 Office of Naval Research Contractors' Meeting on Model-Based Psychological Measurement, Gatlinburg, TN. Williams, B., & Levine, M. V. (in preparation). ForScore: A computer program for nonparametric item r esponse theory . ~ Wright, B. D. (1977). Solving measurement problems with the Hasch model. Journal of Educational Measurement, 14 . 97-116. 32